Identifying Commodity Names Based on XGBoost Model
Xiaofeng Li1,Jing Ma1(),Chi Li2,Hengmin Zhu3
1(College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China) 2(Alibaba Zhejiang Rookie Supply Chain Management Co., Ltd. , Hangzhou 311100, China) 3(College of Economics and Management, Nanjing University of Posts and Telecommunications, Nanjing 210046, China)
[Objective] This paper tries to automatically identify commodity names from product descriptions, aiming to classifying items sold by Taobao. [Methods] First, we retrieved a large number of transaction records from Taobao. Then, we built an e-commerce commodity description dataset and labeled it manually. Third, we created a supervised machine learning algorithm based on the XGBoost model to extract names from product description. [Results] The precision and recall of the algorithm was 85% and 87% for 816 different items from 20,059 records. [Limitations] Categories of commodities in the test corpus need to be expanded. [Conclusions] Machine learning algorithm is an effective way to identify product names.
Varma M, Zisserman A . A Statistical Approach to Texture Classification from Single Images[J]. International Journal of Computer Vision, 2005,62(1-2):61-81.
[2]
Isozaki H, Kazawa H. Efficient Support Vector Classifiers for Named Entity Recognition [C]//Proceedings of the 19th International Conference on Computational Linguistics. 2002: 390-396.
[3]
Bender O, Och F J, Ney H. Maximum Entropy Models for Named Entity Recognition [C]//Proceedings of CoNLL-2003. 2003,4:148-151.
[4]
Klinger R. Automatically Selected Skip Edges in Conditional Random Fields for Named Entity Recognition [C]// Proceedings of Recent Advances in Natural Language Processing. 2011: 580-585.
[5]
Marcińczuk M, Janicki M. Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts [C] // Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics. 2012: 258-269.
[6]
Ritter A, Clark S, Mausam, et al. Named Entity Recognition in Tweets: An Experimental Study [C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 1524-1534.
[7]
Turian J, Ratinov L, Bengio Y. Word Representations: A Simple and General Method for Semi-supervised Learning [C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2010: 384-394.
[8]
Liu X, Zhang S, Wei F, et al. Recognizing Named Entities in Tweets [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011,1:359-367.
[9]
Farmakiotou D, Karkaletsis V, Koutsias J, et al. Rule-based Named Entity Recognition for Greek Financial Texts [C]// Proceedings of the 2000 Workshop on Computational Lexicography and Multimedia Dictionaries. 2000: 1-4.
[10]
Bikel D, Miller S, Schwartz R. Nymble: A High-Performance Learning Name-Finder [C]//Proceedings of the 5th Conference on Applied Natural Language Processing. 1997: 194-201.
( Cheng Yuan, Wushouer Silamu, Maimaitiyiming Hasimua . Automation Text Summarization Based on Comprehensive Characteristics of Sentence[J]. Computer Science, 2015,42(4):226-229.)
( Jia Xiaoting, Wang Mingyang, Cao Yu . Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. Data Analysis and Knowledge Discovery, 2018,2(2):86-95.)
[13]
Arora R, Ravindran B. Latent Dirichlet Allocation Based Multi-Document Summarization [C]// Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. 2008: 91-97.
( Wu Xiaofeng, Zong Chengqing . An Approach to Automatic Summarization by Integrating Latent Dirichlet Allocation in Conditional Random Field[J]. Journal of Chinese Information Processing, 2009,23(6):39-45.)
[15]
Cheng J, Lapata M. Neural Summarization by Extracting Sentences and Words [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 484-494.
[16]
Nallapati R, Zhai F, Zhou B. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents [C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3075-3081.
( Hu Xuegang, Yang Chaoqun, Zhang Yuhong . Short Text Classification Based on Extension with Its Own Features[J]. Application Research of Computers, 2017,34(4):1008-1010.)
( Wang Sheng, Fan Xinghua, Chen Xianlin . Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010,30(3):603-606.)
doi: 10.7666/d.y1989082
( Fan Yunjie, Liu Huailiang . Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3):47-52.)
[20]
Bollegala D, Matsuo Y, Ishizuka M. Measuring Semantic Similarity Between Words Using Web Search Engines [C]// Proceedings of the 16th International Conference on World Wide Web. 2007: 757-766.
[21]
Sahami M, Heilman T D. A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets [C]// Proceedings of the 15th International Conference on World Wide Web. 2006: 377-386.
[22]
Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[23]
Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections [C]// Proceedings of the 17th International Conference on World Wide Web. 2008: 91-100.
[24]
Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-Granularity Topics [C]// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011: 1776-1781.
[25]
Zhou Y, Xu J, Cao J, et al. Hybrid Attention Networks for Chinese Short Text Classification [C]//Proceedings of Neural Information Processing. 2017: 759-769.
[26]
Mikolov T. Word2vec Code [CP/OL]. [ 2015- 09- 18]. .
( Zhou Lian . Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015,25(2):145-148.)
[28]
Salton G, Buckley C . Term-weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
[29]
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.